AITopics | training recurrent neural network

On the Convergence Rate of Training Recurrent Neural Networks

Neural Information Processing SystemsDec-25-2025, 00:46:13 GMT

How can local-search methods such as stochastic gradient descent (SGD) avoid bad local minima in training multi-layer neural networks? Why can they fit random labels even given non-convex and non-smooth architectures? Most existing theory only covers networks with one hidden layer, so can we go deeper? In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing. They are harder to analyze than feedforward neural networks, because the \emph{same} recurrent unit is repeatedly applied across the entire time horizon of length $L$, which is analogous to feedforward networks of depth $L$. We show when the number of neurons is sufficiently large, meaning polynomial in the training data size and in $L$, then SGD is capable of minimizing the regression loss in the linear convergence rate. This gives theoretical evidence of how RNNs can memorize data. More importantly, in this paper we build general toolkits to analyze multi-layer networks with ReLU activations. For instance, we prove why ReLU activations can prevent exponential gradient explosion or vanishing, and build a perturbation theory to analyze first-order approximation of multi-layer networks.

convergence rate, name change, training recurrent neural network, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)

Add feedback

Reviews: On the Convergence Rate of Training Recurrent Neural Networks

Neural Information Processing SystemsFeb-11-2025, 21:03:48 GMT

This paper shows that GD/SGD can minimize the training loss of RNNs with linear convergence rate assuming the hidden layer width is sufficiently large (polynomial in data size and time horizon length). In order to prove this, the authors show that within a small region around the initialization, the norm square of the gradient can be lower bounded by the function value (Theorem 3). The authors further show that the loss function is somewhat smooth (Theorem 4), which guarantees that moving in the negative gradient direction can decrease the function value. This paper builds new techniques to analyze multi-layer ReLU networks. This paper shows that with appropriate initialization, ReLU activations avoid exponential exploding and exponential vanishing.

initialization, step size, training recurrent neural network, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)

Add feedback

Reviews: On the Convergence Rate of Training Recurrent Neural Networks

Neural Information Processing SystemsFeb-11-2025, 21:03:38 GMT

This paper proves poly-time convergence of SGD/GD in over-parametrized RNNs for the first time. Given that there is not many theoretical results in this space. All reviewers find this result a significant progress.

convergence rate, training recurrent neural network

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)

Add feedback

On the Convergence Rate of Training Recurrent Neural Networks

Neural Information Processing SystemsJan-21-2025, 14:09:51 GMT

How can local-search methods such as stochastic gradient descent (SGD) avoid bad local minima in training multi-layer neural networks? Why can they fit random labels even given non-convex and non-smooth architectures? Most existing theory only covers networks with one hidden layer, so can we go deeper? In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing. They are harder to analyze than feedforward neural networks, because the \emph{same} recurrent unit is repeatedly applied across the entire time horizon of length L, which is analogous to feedforward networks of depth L .

convergence rate, multi-layer network, training recurrent neural network, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)

Add feedback

Exploring Flip Flop memories and beyond: training recurrent neural networks with key insights

Jarne, Cecilia

arXiv.org Artificial IntelligenceJul-29-2023

Training neural networks to perform different tasks is relevant across various disciplines. In particular, Recurrent Neural Networks (RNNs) are of great interest in Computational Neuroscience. Open-source frameworks dedicated to Machine Learning, such as Tensorflow and Keras have produced significant changes in the development of technologies that we currently use. This work aims to make a significant contribution by comprehensively investigating and describing the implementation of a temporal processing task, specifically a 3-bit Flip Flop memory. We delve into the entire modelling process, encompassing equations, task parametrization, and software development. The obtained networks are meticulously analyzed to elucidate dynamics, aided by an array of visualization and analysis tools. Moreover, the provided code is versatile enough to facilitate the modelling of diverse tasks and systems. Furthermore, we present how memory states can be efficiently stored in the vertices of a cube in the dimensionally reduced space, supplementing previous results with a distinct approach.

artificial intelligence, machine learning, neural network, (16 more...)

arXiv.org Artificial Intelligence

2010.07858

Country:

South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Denmark > Central Jutland > Aarhus (0.04)
Africa > Senegal > Kolda Region > Kolda (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

On the Convergence Rate of Training Recurrent Neural Networks

Allen-Zhu, Zeyuan, Li, Yuanzhi, Song, Zhao

Neural Information Processing SystemsMar-18-2020, 23:16:34 GMT

How can local-search methods such as stochastic gradient descent (SGD) avoid bad local minima in training multi-layer neural networks? Why can they fit random labels even given non-convex and non-smooth architectures? Most existing theory only covers networks with one hidden layer, so can we go deeper? In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing. They are harder to analyze than feedforward neural networks, because the \emph{same} recurrent unit is repeatedly applied across the entire time horizon of length $L$, which is analogous to feedforward networks of depth $L$.

convergence rate, multi-layer network, training recurrent neural network, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)

Add feedback

[R] Training Recurrent Neural Networks as a Constraint Satisfaction Problem • r/MachineLearning

@machinelearnbotMar-27-2018, 22:35:47 GMT

Obviously not the paper author, but this looks quite interesting. Mostly the fact that it finds all the local minima and can thus select the global minimum from them is nice. Though it would have been nice to see what the tradeoff is in terms of computational space and time complexity compared to error backpropagation.

deep learning, machine learning, training recurrent neural network, (3 more...)

@machinelearnbot

Industry: Media > News (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)

Add feedback

What is Teacher Forcing for Recurrent Neural Networks? - Machine Learning Mastery

#artificialintelligenceMar-1-2018, 19:21:04 GMT

Teacher forcing is a method for quickly and efficiently training recurrent neural network models that use the output from a prior time step as input. It is a network training method critical to the development of deep learning language models used in machine translation, text summarization, and image captioning, among many other applications. In this post, you will discover the teacher forcing as a method for training recurrent neural networks. What is Teacher Forcing for Recurrent Neural Networks? Photo by Nathan Russell, some rights reserved.

artificial intelligence, machine learning, sequence, (14 more...)

#artificialintelligence

Genre: Instructional Material > Course Syllabus & Notes (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Gentle Introduction to Exploding Gradients in Neural Networks - Machine Learning Mastery

#artificialintelligenceDec-18-2017, 04:31:37 GMT

Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training. This has the effect of your model being unstable and unable to learn from your training data. In this post, you will discover the problem of exploding gradients with deep artificial neural networks. A Gentle Introduction to Exploding Gradients in Recurrent Neural Networks Photo by Taro Taylor, some rights reserved. An error gradient is the direction and magnitude calculated during the training of a neural network that is used to update the network weights in the right direction and by the right amount.

artificial intelligence, deep learning, machine learning, (15 more...)

#artificialintelligence

Genre: Instructional Material > Course Syllabus & Notes (0.36)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

training recurrent neural network

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

On the Convergence Rate of Training Recurrent Neural Networks

Reviews: On the Convergence Rate of Training Recurrent Neural Networks

Reviews: On the Convergence Rate of Training Recurrent Neural Networks

On the Convergence Rate of Training Recurrent Neural Networks

Exploring Flip Flop memories and beyond: training recurrent neural networks with key insights

On the Convergence Rate of Training Recurrent Neural Networks

[R] Training Recurrent Neural Networks as a Constraint Satisfaction Problem • r/MachineLearning

What is Teacher Forcing for Recurrent Neural Networks? - Machine Learning Mastery

A Gentle Introduction to Exploding Gradients in Neural Networks - Machine Learning Mastery